data organization & errors

by: Floyd, 9 years ago

Last edited: 9 years ago

Harrison, I have 2 issues 1. I can print the data from my code but when I try to save I get either HTTP Error 404: Not Found or only 1 qb player stats. 2. having trouble visualizing how to organize my data. I'm sure its something simple that you can assist with.
Here is my code:
[import re
import urllib.request
import urllib.parse

def qb():
    
    try:
        url = 'http://sports.yahoo.com/nfl/stats/byposition?'
        values = {'pos':'QB','conference':'NFL','year':'season_2015',
                  'timeframe':'To Date','sort':'49'}

        headers = {}
        headers['User-Agent'] = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"

        data = urllib.parse.urlencode(values)
        data = data.encode('utf-8')

        req = urllib.request.Request(url,data,headers=headers)
        resp = urllib.request.urlopen(req)
        respData = resp.read()
        
        try:
            links = re.findall(r'<a href="(.*?)"',str(respData))
            for link in links:
                if '/nfl/players/' not in link:
                    pass
                else:
                    revUrl = 'http://sports.yahoo.com'+link
                    revReq = urllib.request.Request(revUrl,headers=headers)
                    openRevLink = urllib.request.urlopen(revReq)
                    linkSource = openRevLink.read()
                    try:
                        content = re.findall(r'<div class="yom-mod yom-app yom-data yom-sports-game-log nfl" id="mediasportsplayergamelog">(.*?)<div class="yom-mod yom-app yom-data yom-sports-career-stats " id="mediasportsplayercareerstats">',str(linkSource))
                        plaNum = re.findall(r'<li class="active"><a title="" href="/nfl/players/(.*?)/">',str(linkSource))
                        plaName = re.findall(r'<h1 data-name=".*?" data-url="http://sports.yahoo.com/nfl/players/.*?/">(.*?)</h1>',str(linkSource))
                        gameDate = re.findall(r'<th.*?class="date date".*?>(.*?)</th>',str(content))
                        gameOpp = re.findall(r'<td class="opponent opponent">(.*?)</td>',str(content))
                        gameScore = re.findall(r'<td class="score score"><a href=.*?data-ylk=.*?>(.*?)</a></td>',str(content))
                        gameData = re.findall(r'<td class="nfl-stat-type-.*?" title="(.*?)">(.*?)</td>',str(content))

                        allData = [link,plaNum,plaName,gameDate,gameOpp,gameScore,gameData]
                        #print(allData)

saveFile = open('c:/Users/Documents/dkStats/playerData.csv','w')
                        saveFile.write(str(allData))
                        saveFile.close()

                    except Exception as e:
                        print(str(e))
        except Exception as e:
            print(str(e))
                
    except Exception as e:
        print(str(e))
        
qb()]



You must be logged in to post. Please login or register an account.



I think I found my 404 error my regular expression was sloppy, I've cleaned it up since my original post. Also I've now been receiving a 502 server hangup error. I think I solved this issue by adding time to my requests. It at least let me finish the last scrape. Now all thats left is to put the data into an organized list.

-Floyd 9 years ago

You must be logged in to post. Please login or register an account.